Code
import pandas as pd
kakamana
January 23, 2023
In order to understand exactly what data preprocessing is all about, you will need to take the first steps in your preprocessing journey, which includes exploring data types and dealing with missing data.
This Introduction to Data Preprocessing is part of Datacamp course: Preprocessing for Machine Learning in Python
This is my learning experience of data science through DataCamp
We have a dataset comprised of volunteer information from New York City. The dataset has a number of features, but we want to get rid of features that have at least 3 missing values.
opportunity_id | content_id | vol_requests | event_time | title | hits | summary | is_priority | category_id | category_desc | ... | end_date_date | status | Latitude | Longitude | Community Board | Community Council | Census Tract | BIN | BBL | NTA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4996 | 37004 | 50 | 0 | Volunteers Needed For Rise Up & Stay Put! Home... | 737 | Building on successful events last summer and ... | NaN | NaN | NaN | ... | July 30 2011 | approved | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 5008 | 37036 | 2 | 0 | Web designer | 22 | Build a website for an Afghan business | NaN | 1.0 | Strengthening Communities | ... | February 01 2011 | approved | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 5016 | 37143 | 20 | 0 | Urban Adventures - Ice Skating at Lasker Rink | 62 | Please join us and the students from Mott Hall... | NaN | 1.0 | Strengthening Communities | ... | January 29 2011 | approved | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 5022 | 37237 | 500 | 0 | Fight global hunger and support women farmers ... | 14 | The Oxfam Action Corps is a group of dedicated... | NaN | 1.0 | Strengthening Communities | ... | March 31 2012 | approved | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 5055 | 37425 | 15 | 0 | Stop 'N' Swap | 31 | Stop 'N' Swap reduces NYC's waste by finding n... | NaN | 4.0 | Environment | ... | February 05 2011 | approved | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 35 columns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 665 entries, 0 to 664
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 opportunity_id 665 non-null int64
1 content_id 665 non-null int64
2 vol_requests 665 non-null int64
3 event_time 665 non-null int64
4 title 665 non-null object
5 hits 665 non-null int64
6 summary 665 non-null object
7 is_priority 62 non-null object
8 category_id 617 non-null float64
9 category_desc 617 non-null object
10 amsl 0 non-null float64
11 amsl_unit 0 non-null float64
12 org_title 665 non-null object
13 org_content_id 665 non-null int64
14 addresses_count 665 non-null int64
15 locality 595 non-null object
16 region 665 non-null object
17 postalcode 659 non-null float64
18 primary_loc 0 non-null float64
19 display_url 665 non-null object
20 recurrence_type 665 non-null object
21 hours 665 non-null int64
22 created_date 665 non-null object
23 last_modified_date 665 non-null object
24 start_date_date 665 non-null object
25 end_date_date 665 non-null object
26 status 665 non-null object
27 Latitude 0 non-null float64
28 Longitude 0 non-null float64
29 Community Board 0 non-null float64
30 Community Council 0 non-null float64
31 Census Tract 0 non-null float64
32 BIN 0 non-null float64
33 BBL 0 non-null float64
34 NTA 0 non-null float64
dtypes: float64(13), int64(8), object(14)
memory usage: 182.0+ KB
Taking a look at the volunteer
dataset again, we want to drop rows where the category_desc
column values are missing. We’re going to do this using boolean indexing, by checking to see if we have any null values, and then filtering the dataset so that we only have rows with those values.
Taking another look at the dataset comprised of volunteer information from New York City, we want to know what types we’ll be working with as we start to do more preprocessing.
opportunity_id int64
content_id int64
vol_requests int64
event_time int64
title object
hits int64
summary object
is_priority object
category_id float64
category_desc object
amsl float64
amsl_unit float64
org_title object
org_content_id int64
addresses_count int64
locality object
region object
postalcode float64
primary_loc float64
display_url object
recurrence_type object
hours int64
created_date object
last_modified_date object
start_date_date object
end_date_date object
status object
Latitude float64
Longitude float64
Community Board float64
Community Council float64
Census Tract float64
BIN float64
BBL float64
NTA float64
dtype: object
If you take a look at the volunteer
dataset types, you’ll see that the column hits
is type object
. But, if you actually look at the column, you’ll see that it consists of integers. Let’s convert that column to type int
.
0 737
1 22
2 62
3 14
4 31
Name: hits, dtype: int64
opportunity_id int64
content_id int64
vol_requests int64
event_time int64
title object
hits int32
summary object
is_priority object
category_id float64
category_desc object
amsl float64
amsl_unit float64
org_title object
org_content_id int64
addresses_count int64
locality object
region object
postalcode float64
primary_loc float64
display_url object
recurrence_type object
hours int64
created_date object
last_modified_date object
start_date_date object
end_date_date object
status object
Latitude float64
Longitude float64
Community Board float64
Community Council float64
Census Tract float64
BIN float64
BBL float64
NTA float64
dtype: object
In the volunteer
dataset, we’re thinking about trying to predict the category_desc
variable using the other features in the dataset. First, though, we need to know what the class distribution (and imbalance) is for that label.
We know that the distribution of variables in the category_desc
column in the volunteer dataset is uneven. If we wanted to train a model to try to predict category_desc
, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.
from sklearn.model_selection import train_test_split
# Create a data with all columns except category_desc
volunteer_X = volunteer.dropna(subset=['category_desc'], axis=0)
# Create a category_desc labels dataset
volunteer_y = volunteer.dropna(subset=['category_desc'], axis=0)[['category_desc']]
# Use stratified sampling to split up the dataset according to the volunteer_y dataset
X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)
# Print out the category_desc counts on the training y labels
print(y_train['category_desc'].value_counts())
Strengthening Communities 230
Helping Neighbors in Need 89
Education 69
Health 39
Environment 24
Emergency Preparedness 11
Name: category_desc, dtype: int64
Warning: stratify sampling on
train_test_split
cannot handle theNaN
data, so you need to drop NaN values before sampling